Avoidance of Model Re-Induction in SVM-Based Feature Selection for Text Categorization
نویسندگان
چکیده
Searching the feature space for a subset yielding optimum performance tends to be expensive, especially in applications where the cardinality of the feature space is high (e.g., text categorization). This is particularly true for massive datasets and learning algorithms with worse than linear scaling factors. Linear Support Vector Machines (SVMs) are among the top performers in the text classification domain and often work best with very rich feature representations. Even they however benefit from reducing the number of features, sometimes to a large extent. In this work we propose alternatives to exact re-induction of SVM models during the search for the optimum feature subset. The approximations offer substantial benefits in terms of computational efficiency. We are able to demonstrate that no significant compromises in terms of model quality are made and, moreover, in some cases gains in accuracy can be achieved.
منابع مشابه
Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA
With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...
متن کاملTwo Step POS Selection for SVM Based Text Categorization
Although many researchers have verified the superiority of Support Vector Machine (SVM) on text categorization tasks, some recent papers have reported much lower performance of SVM based text categorization methods when focusing on all types of parts of speech (POS) as input words and treating large numbers of training documents. This was caused by the overfitting problem that SVM sometimes sel...
متن کاملForesTexter: An efficient random forest algorithm for imbalanced text categorization
In this paper, we propose a new Random Forest (RF) based ensemble method, ForesTexter, to solve the imbalanced text categorization problems. RF has shown great success in many real-world applications. However, the problem of learning from text data with class imbalance is a relatively new challenge that needs to be addressed. A RF algorithm tends to use a simple random sampling of features in b...
متن کاملMMR-based Feature Selection for Text Categorization
We introduce a new method of feature selection for text categorization. Our MMR-based feature selection method strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization. Empirical results show that MMR-based feature selection is more effective than Koller & Sahami’s method, which is one of greedy feature selection ...
متن کاملOscillating Feature Subset Search Algorithm for Text Categorization
A major characteristic of text document categorization problems is the extremely high dimensionality of text data. In this paper we explore the usability of the Oscillating Search algorithm for feature/word selection in text categorization. We propose to use the multiclass Bhattacharyya distance for multinomial model as the global feature subset selection criterion for reducing the dimensionali...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007